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Abstract 

This  paper  presents  a  statistical  decision  procedure  for 
lexical  ambiguity  resolution.  The  algorithm  exploits 
both  local  syntactic  patterns  and  more  distant  collo¬ 
cational  evidence,  generating  an  efficient,  effective,  and 
highly  perspicuous  recipe  for  resolving  a  given  ambigu¬ 
ity.  By  identifying  and  utilizing  only  the  single  best  dis¬ 
ambiguating  evidence  in  a  target  context,  the  algorithm 
avoids  the  problematic  complex  modeling  of  statistical 
dependencies.  Although  directly  applicable  to  a  wide 
class  of  ambiguities,  the  algorithm  is  described  and  eval¬ 
uated  in  a  realistic  case  study,  the  problem  of  restoring 
missing  accents  in  Spanish  and  French  text.  Current 
accuracy  exceeds  99%  on  the  full  task,  and  typically  is 
over  90%  for  even  the  most  difficult  ambiguities. 

INTRODUCTION 

This  paper  presents  a  general-purpose  statistical  deci¬ 
sion  procedure  for  lexical  ambiguity  resolution  based  on 
decision  lists  (Rivest,  1987).  The  algorithm  considers 
multiple  types  of  evidence  in  the  context  of  an  ambigu¬ 
ous  word,  exploiting  differences  in  collocational  distri¬ 
bution  as  measured  by  log-likelihoods.  Unlike  standard 
Bayesian  approaches,  however,  it  does  not  combine  the 
log-likelihoods  of  all  available  pieces  of  contextual  evi¬ 
dence,  but  bases  its  classifications  solely  on  the  single 
most  reliable  piece  of  evidence  identified  in  the  target 
context.  Perhaps  surprisingly,  this  strategy  appears  to 
yield  the  same  or  even  slightly  better  precision  than 
the  combination  of  evidence  approach  when  trained  on 
the  same  features.  It  also  brings  with  it  several  ad¬ 
ditional  advantages,  the  greatest  of  which  is  the  abil¬ 
ity  to  include  multiple,  highly  non-independent  sources 
of  evidence  without  complex  modeling  of  dependencies. 
Some  other  advantages  are  significant  simplicity  and 
ease  of  implementation,  transparent  understandability 
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of  the  resulting  decision  list,  and  easy  adaptability  to 
new  domains.  The  particular  domain  chosen  here  as  a 
case  study  is  the  problem  of  restoring  missing  accents1 
to  Spanish  and  French  text.  Because  it  requires  the  res¬ 
olution  of  both  semantic  and  syntactic  ambiguity,  and 
offers  an  objective  ground  truth  for  automatic  evalua¬ 
tion,  it  is  particularly  well  suited  for  demonstrating  and 
testing  the  capabilities  of  the  given  algorithm.  It  is  also 
a  practical  problem  with  immediate  application. 

PROBLEM  DESCRIPTION 

The  general  problem  considered  here  is  the  resolu¬ 
tion  of  lexical  ambiguity,  both  syntactic  and  seman¬ 
tic,  based  on  properties  of  the  surrounding  context. 
Accent  restoration  is  merely  an  instance  of  a  closely- 
related  class  of  problems  including  word-sense  disam¬ 
biguation,  word  choice  selection  in  machine  translation, 
homograph  and  homophone  disambiguation,  and  capi¬ 
talization  restoration.  The  given  algorithm  may  be  used 
to  solve  each  of  these  problems,  and  has  been  applied 
without  modification  to  the  case  of  homograph  disam¬ 
biguation  in  speech  synthesis  (Sproat,  Hirschberg  and 
Yarowsky,  1992). 

It  may  not  be  immediately  apparent  to  the  reader 
why  this  set  of  problems  forms  a  natural  class,  similar 
in  origin  and  solvable  by  a  single  type  of  algorithm.  In 
each  case  it  is  necessary  to  disambiguate  two  or  more 
semantically  distinct  word-forms  which  have  been  con¬ 
flated  into  the  same  representation  in  some  medium. 

In  the  prototypical  instance  of  this  class,  word- 
sense  disambiguation,  such  distinct  semantic  concepts 
as  river  bank,  financial  bank  and  to  bank  an  airplane  are 
conflated  in  ordinary  text.  Word  associations  and  syn¬ 
tactic  patterns  are  sufficient  to  identify  and  label  the 
correct  form.  In  homophone  disambiguation,  distinct 
semantic  concepts  such  as  ceiling  and  sealing  have  also 
become  represented  by  the  same  ambiguous  form,  but 
in  the  medium  of  speech  and  with  similar  disambiguat¬ 
ing  clues. 

Capitalization  restoration  is  a  similar  problem  in  that 
distinct  semantic  concepts  such  as  AIDS/ aids  (disease 
or  helpful  tools)  and  Bush/  bush  (president  or  shrub) 

Tor  brevity,  the  term  accent  will  typically  refer  to  the 
general  class  of  accents  and  other  diacritics,  including  e,e,e, 6 
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are  ambiguous,  but  in  the  medium  of  all-capitalized  (or 
casefree)  text,  which  includes  titles  and  the  beginning 
of  sentences.  Note  that  what  was  once  just  a  capital¬ 
ization  ambiguity  between  Prolog  (computer  language) 
and  prolog  (introduction)  has  is  becoming  a  “sense”  am¬ 
biguity  since  the  computer  language  is  now  often  writ¬ 
ten  in  lower  case,  indicating  the  fundamental  similarity 
of  these  problems. 

Accent  restoration  involves  lexical  ambiguity,  such 
as  between  the  concepts  cote  (coast)  and  edit  (side), 
in  textual  mediums  where  accents  are  missing.  It  is 
traditional  in  Spanish  and  French  for  diacritics  to  be 
omitted  from  capitalized  letters.  This  is  particularly  a 
problem  in  all-capitalized  text  such  as  headlines.  Ac¬ 
cents  in  on-line  text  may  also  be  systematically  stripped 
by  many  computational  processes  which  are  not  8-bit 
clean  (such  as  some  e-mail  transmissions),  and  may  be 
routinely  omitted  by  Spanish  and  French  typists  in  in¬ 
formal  computer  correspondence. 

Missing  accents  may  create  both  semantic  and  syn¬ 
tactic  ambiguities,  including  tense  or  mood  distinctions 
which  may  only  be  resolved  by  distant  temporal  mark¬ 
ers  or  non-syntactic  cues.  The  most  common  accent 
ambiguity  in  Spanish  is  between  the  endings  -o  and 
-6,  such  as  in  the  case  of  completo  vs.  completo.  This 
is  a  present/preterite  tense  ambiguity  for  nearly  all 
-ar  verbs,  and  very  often  also  a  part  of  speech  ambi¬ 
guity,  as  the  -o  form  is  a  frequently  a  noun  as  well. 
The  second  most  common  general  ambiguity  is  between 
the  past-subjunctive  and  future  tenses  of  nearly  all  -ar 
verbs  (eg:  terminara  vs.  terminara),  both  of  which 
are  3rd  person  singular  forms.  This  is  a  particularly 
challenging  class  and  is  not  readily  amenable  to  tradi¬ 
tional  part-of-speech  tagging  algorithms  such  as  local 
trigram-based  taggers.  Some  purely  semantic  ambigui¬ 
ties  include  the  nouns  secrelana  (secretary)  vs.  secre- 
tari'a  (secretariat),  sabana  (grassland)  vs.  sabana  (bed 
sheet),  and  politico,  (female  politician)  vs.  politico  (pol¬ 
itics).  The  distribution  of  ambiguity  types  in  French  is 
similar.  The  most  common  case  is  between  -e  and  -e, 
which  is  both  a  past  participle/present  tense  ambigu¬ 
ity,  and  often  a  part-of-speech  ambiguity  (with  nouns 
and  adjectives)  as  well.  Purely  semantic  ambiguities  are 
more  common  than  in  Spanish,  and  include  traite/traite 
(treaty/draft),  marche/ marche  (step/market),  and  the 
cote  example  mentioned  above. 

Accent  restoration  provides  several  advantages  as  a 
case  study  for  the  explication  and  evaluation  of  the  pro¬ 
posed  decision-list  algorithm.  First,  as  noted  above,  it 
offers  a  broad  spectrum  of  ambiguity  types,  both  syn¬ 
tactic  and  semantic,  and  shows  the  ability  of  the  algo¬ 
rithm  to  handle  these  diverse  problems.  Second,  the 
correct  accent  pattern  is  directly  recoverable:  unlim¬ 
ited  quantities  of  test  material  may  be  constructed  by 
stripping  the  accents  from  correctly-accented  text  and 
then  using  the  original  as  a  fully  objective  standard 
for  automatic  evaluation.  By  contrast,  in  traditional 
word-sense  disambiguation,  hand-labeling  training  and 
test  data  is  a  laborious  and  subjective  task.  Third,  the 
task  of  restoring  missing  accents  and  resolving  ambigu¬ 


ous  forms  shows  considerable  commercial  applicability, 
both  as  a  stand-alone  application  or  part  of  the  front- 
end  to  NLP  systems.  There  is  also  a  large  potential 
commercial  market  in  its  use  in  grammar  and  spelling 
correctors,  and  in  aids  for  inserting  the  proper  diacrit¬ 
ics  automatically  when  one  types2.  Thus  while  accent 
restoration  may  not  be  be  the  prototypical  member  of 
the  class  of  lexical-ambiguity  resolution  problems,  it  is 
an  especially  useful  one  for  describing  and  evaluating  a 
proposed  solution  to  this  class  of  problems. 

PREVIOUS  WORK 

The  problem  of  accent  restoration  in  text  has  received 
minimal  coverage  in  the  literature,  especially  in  En¬ 
glish,  despite  its  many  interesting  aspects.  Most  work 
in  this  area  appears  to  done  in  the  form  of  in-house 
or  commercial  software,  so  for  the  most  part  the  prob¬ 
lem  and  its  potential  solutions  are  without  comprehen¬ 
sive  published  analysis.  The  best  treatment  I’ve  discov¬ 
ered  is  from  Fernand  Marty  (1986,  1992),  who  for  more 
than  a  decade  has  been  painstakingly  crafting  a  system 
which  includes  accent  restoration  as  part  of  a  compre¬ 
hensive  system  of  syntactic,  morphological  and  phonetic 
analysis,  with  an  intended  application  in  French  text- 
to-speech  synthesis.  He  incorporates  information  ex¬ 
tracted  from  several  French  dictionaries  and  uses  basic 
collocational  and  syntactic  evidence  in  hand-built  rules 
and  heuristics.  While  the  scope  and  complexity  of  this 
effort  is  remarkable,  this  paper  will  focus  on  a  solution 
to  the  problem  which  requires  considerably  less  effort 
to  implement. 

The  scope  of  work  in  lexical  ambiguity  resolution  is 
very  large.  Thus  in  the  interest  of  space,  discussion 
will  focus  on  the  direct  historic  precursors  and  sources 
of  inspiration  for  the  approach  presented  here.  The 
central  tradition  from  which  it  emerges  is  that  of  the 
Bayesian  classifier  (Mosteller  and  Wallace,  1964).  This 
was  expanded  upon  by  (Gale  et  al.,  1992),  and  in  a 
class-based  variant  by  (Yarowsky,  1992).  Decision  trees 
(Brown,  1991)  have  been  usefully  applied  to  word-sense 
ambiguities,  and  IIMM  part-of-speech  taggers  (Jelinek 
1985,  Church  1988,  Merialdo  1990)  have  addressed  the 
syntactic  ambiguities  presented  here.  Hearst  (1991) 
presented  an  effective  approach  to  modeling  local  con¬ 
textual  evidence,  while  Resnik  (1993)  gave  a  classic 
treatment  of  the  use  of  word  classes  in  selectional  con¬ 
straints.  An  algorithm  for  combining  syntactic  and  se¬ 
mantic  evidence  in  lexical  ambiguity  resolution  has  been 
realized  in  (Chang  et  al.,  1992).  A  particularly  success¬ 
ful  algorithm  for  integrating  a  wide  diversity  of  evidence 
types  using  error  driven  learning  was  presented  in  Brill 
(1993).  While  it  has  been  applied  primarily  to  syntac¬ 
tic  problems,  it  shows  tremendous  promise  for  equally 
impressive  results  in  the  area  of  semantic  ambiguity  res¬ 
olution. 

2  Such  a  tool  would  particularly  useful  for  typing  Spanish 
or  French  on  Anglo-centric  computer  keyboards,  where  en¬ 
tering  accents  and  other  diacritic  marks  every  few  keystrokes 
can  be  laborious. 
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The  formal  model  of  decision  lists  was  presented  in 
(Rivest,  1987).  I  have  restricted  feature  conjuncts  to  a 
much  narrower  complexity  than  allowed  in  the  original 
model  -  namely  to  word  and  class  trigrams.  The  current 
approach  was  initially  presented  in  (Sproat  et  al.,  1992), 
applied  to  the  problem  of  homograph  resolution  in  text- 
to-speech  synthesis.  The  algorithm  achieved  97%  mean 
accuracy  on  a  disambiguation  task  involving  a  sample 
of  13  homographs3. 

ALGORITHM 

Step  1:  Identify  the  Ambiguities  in  Accent 
Pattern 

Most  words  in  Spanish  and  French  exhibit  only  one  ac¬ 
cent  pattern.  Basic  corpus  analysis  will  indicate  which 
is  the  most  common  pattern  for  each  word,  and  may  be 
used  in  conjunction  with  or  independent  of  dictionaries 
and  other  lexical  resources. 

The  initial  step  is  to  take  a  histogram  of  a  corpus  with 
accents  and  diacritics  retained,  and  compute  a  table  of 
accent  pattern  distributions  as  follows: 


De-accented  Form 

Accent  Pattern 

ST1 

Number 

cesse 

cesse 

53% 

669 

cesse 

47% 

593 

cout 

cout 

100% 

330 

couta 

couta 

"100%  1 

41 

coute 

coute 

53%  1 

107 

coute 

47% 

96 

cote 

cote 

69% 

2645 

cote 

28% 

1040 

cote 

3% 

99 

cote 

<1% 

15 

cotiere 

cotiere 

~T00%“ 
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For  words  with  multiple  accent  patterns,  steps  2-5 
are  applied. 

Step  2:  Collect  Training  Contexts 

For  a  particular  case  of  accent  ambiguity  identified 
above,  collect  ±fc  words  of  context  around  all  occur¬ 
rences  in  the  corpus,  label  the  concordance  line  with 
the  observed  accent  pattern,  and  then  strip  the  accents 
from  the  data.  This  will  yield  a  training  set  such  as  the 
following: 


Pattern 

Context 

(1)  cote 
(l)  cote 
(1)  cote 

du  laisser  de  cote  faute  de  temps 
appeler  1’  autre  cote  de  1’  atlantique 
passe  de  notre  cote  de  la  frontiere 

(2)  cote 
(2)  cote 
(2)  cote 

vivre  sur  notre  cote  ouest  toujours  verte 
creer  sur  la  cote  du  labrador  des 
travaillaient  cote  a  cote  ,  ils  avaient 

The  training  corpora  used  in  this  experiment  were  the 
Spanish  AP  Newswire  (1991-1993,  49  million  words), 

3  Baseline  accuracy  for  this  data  (using  the  most  common 
pronunciation)  is  67%. 


the  French  Canadian  Hansards  (1986-1988,  19  million 
words),  and  a  collection  from  Le  Monde  (1  million 
words). 

Step  3:  Measure  Collocational  Distributions 

The  driving  force  behind  this  disambiguation  algorithm 
is  the  uneven  distribution  of  collocations4  with  respect 
to  the  ambiguous  token  being  classified.  Certain  collo¬ 
cations  will  indicate  one  accent  pattern,  while  different 
collocations  will  tend  to  indicate  another.  The  goal  of 
this  stage  of  the  algorithm  is  to  measure  a  large  num¬ 
ber  of  collocational  distributions  to  select  those  which 
are  most  useful  in  identifying  the  accent  pattern  of  the 
ambiguous  word. 

The  following  are  the  initial  types  of  collocations  con¬ 
sidered: 

•  Word  immediately  to  the  right  (+1  W) 

•  Word  immediately  to  the  left  (-1  W) 

•  Word  found  in  ±fc  word  window5  (±k  W) 

•  Pair  of  words  at  offsets  -2  and  -1 

•  Pair  of  words  at  offsets  -1  and  +1 

•  Pair  of  words  at  offsets  +1  and  +2 

For  the  two  major  accent  patterns  of  the  French  word 
cote ,  below  is  a  small  sample  of  these  distributions  for 
several  types  of  collocations: 


Position 

Collocation 

cote 

cote 

-1  W 

du  cote 

0 

536 

la  cote 

766 

1 

un  cote 

0 

216 

notre  cote 

10 

70 

+1  w 

cote  ouest 

288 

1 

cote  est 

174 

3 

cote  du 

55 

156 

+lw,+2w 

cote  du  gouvernement 

0 

62 

-2w,-lw 

cote  a  cote 

23 

0 

±k  w 

poisson  (in  ±fc  words) 

20 

0 

±k  w 

ports  (in  ±fc  words) 

22 

0 

±k  w 

opposition  (in  ±k  words) 

o 

39 

This  core  set  of  evidence  presupposes  no  language- 
specific  knowledge.  However,  if  additional  language  re¬ 
sources  are  available,  it  may  be  desirable  to  include  a 
larger  feature  set.  For  example,  if  lemmatization  proce¬ 
dures  are  available,  collocational  measures  for  morpho¬ 
logical  roots  will  tend  to  yield  more  succinct  and  gener- 
alizable  evidence  than  measuring  the  distributions  for 
each  of  the  inflected  forms.  If  part-of-speech  informa¬ 
tion  is  available  in  a  lexicon,  it  is  useful  to  compute  the 

4The  term  collocation  is  used  here  in  its  broad  sense, 
meaning  words  appearing  adjacent  to  or  near  each  other 
(literally,  in  the  same  location),  and  does  not  imply  only 
idiomatic  or  non-composition  al  associations. 

sThe  optimal  value  of  k  is  sensitive  to  the  type  of  ambi¬ 
guity.  Semantic  or  topic-based  ambiguities  warrant  a  larger 
window  (fc  S3  20  —  50),  while  more  local  syntactic  ambiguities 
warrant  a  smaller  window  (fc  S3  3  or  4) 


90 


distributions  for  part-of-speech  bigrams  and  trigrams 
as  above.  Note  that  it’s  not  necessary  to  determine  the 
actual  parts-of-speech  of  words  in  context;  using  only 
the  most  likely  part  of  speech  or  a  set  of  all  possibil¬ 
ities  will  produce  adequate,  if  somewhat  diluted,  dis¬ 
tributional  evidence.  Similarly,  it  is  useful  to  compute 
collocational  statistics  for  arbitrary  word  classes,  such 
as  the  class  WEEKDAY  ={  domingo,  lunes,  martes,  ...  }. 
Such  classes  may  cover  many  types  of  associations,  and 
need  not  be  mutually  exclusive. 

For  the  French  experiments,  no  additional  linguistic 
knowledge  or  lexical  resources  were  used.  The  decision 
lists  were  trained  solely  on  raw  word  associations  with¬ 
out  additional  patterns  based  on  part  of  speech,  mor¬ 
phological  analysis  or  word  class.  Hence  the  reported 
performance  is  representative  of  what  may  be  achieved 
with  a  rapid,  inexpensive  implementation  based  strictly 
on  the  distributional  properties  of  raw  text. 

For  the  Spanish  experiments,  a  richer  set  of  evidence 
was  utilized.  Use  of  a  morphological  analyzer  (devel¬ 
oped  by  Tzoukermann  and  Liberman  (1990))  allowed 
distributional  measures  to  be  computed  for  associations 
of  lemmas  (morphological  roots),  improving  general¬ 
ization  to  different  inflected  forms  not  observed  in  the 
training  data.  Also,  a  basic  lexicon  with  possible  parts 
of  speech  (augmented  by  the  morphological  analyzer) 
allowed  adjacent  part-of-speech  sequences  to  be  used 
as  disambiguating  evidence.  A  relatively  coarse  level  of 
analysis  (e.g.  noun,  adjective,  subject-pronoun, 
article,  etc.),  augmented  with  independently  mod¬ 
eled  features  representing  gender,  person,  and  num¬ 
ber,  was  found  to  be  most  effective.  However,  when 
a  word  was  listed  with  multiple  parts-of-speech,  no  rel¬ 
ative  frequency  distribution  was  available.  Such  words 
were  given  a  part-of-speech  tag  consisting  of  the  union 
of  the  possibilities  (eg  ADJECTIVE-NOUN),  as  in  Ku- 
piec  (1989).  Thus  sequences  of  pure  part-of-speech  tags 
were  highly  reliable,  while  the  potential  sources  of  noise 
were  isolated  and  modeled  separately.  In  addition,  sev¬ 
eral  word  classes  such  as  WEEKDAY  and  MONTH  were 
defined,  primarily  focusing  on  time  words  because  so 
many  accent  ambiguities  involve  tense  distinctions. 

To  build  a  full  part  of  speech  tagger  for  Spanish  would 
be  quite  costly  (and  require  special  tagged  corpora). 
The  current  approach  uses  just  the  information  avail¬ 
able  in  dictionaries,  exploiting  only  that  which  is  useful 
for  the  accent  restoration  task.  Were  dictionaries  not 
available,  a  productive  approximation  could  have  been 
made  using  the  associational  distributions  of  suffixes 
(such  as  -aba,  -aste,  - amos )  which  are  often  satisfactory 
indicators  of  part  of  speech  in  morphologically  rich  lan¬ 
guages  such  as  Spanish. 

The  use  of  the  word-class  and  part-of-speech  data  is 
illustrated  below,  with  the  example  of  distinguishing 
ierminara/ terminara  (a  subjunctive/future  tense  am¬ 
biguity): 


Collocation 

termin¬ 

ara 

termin¬ 

ara 

PREPOSITION  que  terminara 

31 

0 

de  que  terminara 

15 

0 

para  que  terminara 

14 

0 

NOUN  que  terminara 

0 

13 

carrera  que  terminara 

0 

3 

reunion  que  terminara 

0 

2 

acuerdo  que  terminara 

0 

2 

que  terminara 

42 

37 

weekday  (within  ±k  words) 

0 

23 

domingo  (within  ±k  words) 

0 

10 

viernes  (within  ±k  words) 

0 

4 

Step  4:  Sort  by  Log-Likelihood  into 
Decision  Lists 

The  next  step  is  to  compute  the  ratio  called  the  log- 
likelihood: 


Abs(Log( 


P r (Accent -Patter ni  \C ollocationi) 
Pr(Accent-Pattern2\Collocationi) 


)) 


The  collocations  most  strongly  indicative  of  a  partic¬ 
ular  pattern  will  have  the  largest  log-likelihood.  Sorting 
by  this  value  will  list  the  strongest  and  most  reliable  ev¬ 
idence  first6. 

Evidence  sorted  in  the  above  manner  will  yield  a  deci¬ 
sion  list  like  the  following,  highly  abbreviated  example7: 


LogL 

Evidence  Classification 

8.28 
f7.24 
f7. 14 
6.87 
6.64 
5.82 
f5.45 

PREPOSITION  que  terminara  =>  terminara 
de  que  terminara  terminara 

para  que  terminara  =>■  terminara 

y  terminara  =$■  terminara 

weekday  (within  ±k  words)  =>  terminara 
NOUN  que  terminara  =>  terminara 

domingo  (within  ±k  words)  =>  terminara 

The  resulting  decision  list  is  used  to  classify  new  ex¬ 
amples  by  identifying  the  highest  line  in  the  list  that 
matches  the  given  context  and  returning  the  indicated 


6Problems  arise  when  an  observed  count  is  0.  Clearly 
the  probability  of  seeing  cote  in  the  context  of  poisson  is 
not  0,  even  though  no  such  collocation  was  observed  in  the 
training  data.  Finding  a  more  accurate  probability  estimate 
depends  on  several  factors,  including  the  size  of  the  train¬ 
ing  sample,  nature  of  the  collocation  (adjacent  bigrams  or 
wider  context),  our  prior  expectation  about  the  similarity 
of  contexts,  and  the  amount  of  noise  in  the  training  data. 
Several  smoothing  methods  have  been  explored  here,  includ¬ 
ing  those  discussed  in  (Gale  et  al.,  1992).  In  one  technique, 
all  observed  distributions  with  the  same  0-denominator  raw 
frequency  ratio  (such  as  2/0)  are  taken  collectively,  the  av¬ 
erage  agreement  rate  of  these  distributions  with  additional 
held-out  training  data  is  measured,  and  from  this  a  more 
realistic  estimate  of  the  likelihood  ratio  (e.g.  1.8/0. 2)  is 

computed.  However,  in  the  simplest  implementation,  satis¬ 
factory  results  may  be  achieved  by  adding  a  small  constant 
a  to  the  numerator  and  denominator,  where  a  is  selected 
empirically  to  optimize  classification  performance.  For  this 
data,  relatively  small  a  (between  0.1  and  0.25)  tended  to  be 
effective,  while  noisier  training  data  warrant  larger  a. 

7 Entries  marked  with  f  are  pruned  in  Step  5,  below. 


91 


classification.  See  Step  7  for  a  full  description  of  this 
process. 

Step  5:  Optional  Pruning  and  Interpolation 

A  potentially  useful  optional  procedure  is  the  interpo¬ 
lation  of  log-likelihood  ratios  between  those  computed 
from  the  full  data  set  (the  global  probabilities)  and  those 
computed  from  the  residual  training  data  left  at  a  given 
point  in  the  decision  list  when  all  higher-ranked  pat¬ 
terns  failed  to  match  (i.e.  the  residual  probabilities). 
The  residual  probabilities  are  more  relevant,  but  since 
the  size  of  the  residual  training  data  shrinks  at  each 
level  in  the  list,  they  are  often  much  more  poorly  es¬ 
timated  (and  in  many  cases  there  may  be  no  relevant 
data  left  in  the  residual  on  which  to  compute  the  dis¬ 
tribution  of  accent  patterns  for  a  given  collocation).  In 
contrast,  the  global  probabilities  are  better  estimated 
but  less  relevant.  A  reasonable  compromise  is  to  inter¬ 
polate  between  the  two,  where  the  interpolated  estimate 
is  f3  x  global  +  7  X  residual.  When  the  residual  proba¬ 
bilities  are  based  on  a  large  training  set  and  are  well  es¬ 
timated,  7  should  dominate,  while  in  cases  the  relevant 
residual  is  small  or  non-existent,  (3  should  dominate. 
If  always  f3  =  0  and  7=1  (exclusive  use  of  the  resid¬ 
ual),  the  result  is  a  degenerate  (strictly  right-branching) 
decision  tree  with  severe  sparse  data  problems.  Alter¬ 
nately,  if  one  assumes  that  likelihood  ratios  for  a  given 
collocation  are  functionally  equivalent  at  each  line  of  a 
decision  list,  then  one  could  exclusively  use  the  global 
(always  /3  =  1  and  7  =  0).  This  is  clearly  the  easiest 
and  fastest  approach,  as  probability  distributions  do 
not  need  to  be  recomputed  as  the  list  is  constructed. 
Which  approach  is  best?  Using  only  the  global  proa¬ 
bilities  does  surprisingly  well,  and  the  results  cited  here 
are  based  on  this  readily  replicatable  procedure.  The 
reason  is  grounded  in  the  strong  tendency  of  a  word  to 
exhibit  only  one  sense  or  accent  pattern  per  collocation 
(discussed  in  Step  6  and  (Yarowsky,  1993)).  Most  clas¬ 
sifications  are  based  onai  vs.  0  distribution,  and  while 
the  magnitude  of  the  log-likelihood  ratios  may  decrease 
in  the  residual,  they  rarely  change  sign.  There  are  cases 
where  this  does  happen  and  it  appears  that  some  inter¬ 
polation  helps,  but  for  this  problem  the  relatively  small 
difference  in  performance  does  not  seem  to  justify  the 
greatly  increased  computational  cost. 

Two  kinds  of  optional  pruning  can  also  increase  the 
efficiency  of  the  decision  lists.  The  first  handles  the 
problem  of  “redundancy  by  subsumption,”  which  is 
clearly  visible  in  the  example  decision  lists  above  (in 
WEEKDAY  and  domingo).  When  lemmas  and  word- 
classes  precede  their  member  words  in  the  list,  the  latter 
will  be  ignored  and  can  be  pruned.  If  a  bigram  is  un¬ 
ambiguous,  probability  distributions  for  dependent  tri¬ 
grams  will  not  even  be  generated,  since  they  will  provide 
no  additional  information. 

The  second,  pruning  in  a  cross-validation  phase,  com¬ 
pensates  for  the  minimal  observed  over-modeling  of  the 
data.  Once  a  decision  list  is  built  it  is  applied  to  its  own 
training  set  plus  some  held-out  cross-validation  data 
(not  the  test  data).  Lines  in  the  list  which  contribute 


to  more  incorrect  classifications  than  correct  ones  are 
removed.  This  also  indirectly  handles  problems  that 
may  result  from  the  omission  of  the  interpolation  step. 
If  space  is  at  a  premium,  lines  which  are  never  used  in 
the  cross-validation  step  may  also  be  pruned.  However, 
useful  information  is  lost  here,  and  words  pruned  in  this 
way  may  have  contributed  to  the  classification  of  test¬ 
ing  examples.  A  3%  drop  in  performance  is  observed, 
but  an  over  90%  reduction  in  space  is  realized.  The  op¬ 
timum  pruning  strategy  is  subject  to  cost-benefit  anal¬ 
ysis.  In  the  results  reported  below,  all  pruning  except 
this  final  space-saving  step  was  utilized. 

Step  6:  Train  Decision  Lists  for  General 
Classes  of  Ambiguity 

For  many  similar  types  of  ambiguities,  such  as  the  Span¬ 
ish  subjunctive/future  distinction  between  -ara  and 
ara,  the  decision  lists  for  individual  cases  will  be  quite 
similar  and  use  the  same  basic  evidence  for  the  classifi¬ 
cation  (such  as  presence  of  nearby  time  adverbials).  It 
is  useful  to  build  a  general  decision  list  for  all  -ara/ ara 
ambiguities.  This  also  tends  to  improve  performance 
on  words  for  which  there  is  inadequate  training  data 
to  build  a  full  individual  decision  lists.  The  process 
for  building  this  general  class  disambiguator  is  basically 
identical  to  that  described  in  Steps  2-5  above,  except 
that  in  Step  2,  training  contexts  are  pooled  for  all  in¬ 
dividual  instances  of  the  class  (such  as  all  -ara/ -ara 
ambiguities).  It  is  important  to  give  each  individual  - 
ara  word  roughly  equal  representation  in  the  training 
set,  however,  lest  the  list  model  the  idiosyncrasies  of 
the  most  frequent  class  members,  rather  than  identify 
the  shared  common  features  representative  of  the  full 
class. 

In  Spanish,  decision  lists  are  trained  for  the  general 
ambiguity  classes  including  -0/-0,  -e/-e,  -ara/ -ara,  and 
-aranj -ardn.  For  each  ambiguous  word  belonginging  to 
one  of  these  classes,  the  accuracy  of  the  word-specific 
decision  list  is  compared  with  the  class-based  list.  If  the 
class’s  list  performs  adequately  it  is  used.  Words  with 
idiosyncrasies  that  are  not  modeled  well  by  the  class’s 
list  retain  their  own  word-specific  decision  list. 

Step  7:  Using  the  Decision  Lists 

Once  these  decision  lists  have  been  created,  they  may 
be  used  in  real  time  to  determine  the  accent  pattern  for 
ambiguous  words  in  new  contexts. 

At  run  time,  each  word  encountered  in  a  text  is 
looked  up  in  a  table.  If  the  accent  pattern  is  unam¬ 
biguous,  as  determined  in  Step  1,  the  correct  pattern 
is  printed.  Ambiguous  words  have  a  table  of  the  pos¬ 
sible  accent  patterns  and  a  pointer  to  a  decision  list, 
either  for  that  specific  word  or  its  ambiguity  class  (as 
determined  in  Step  6).  This  given  list  is  searched  for 
the  highest  ranking  match  in  the  word’s  context,  and 
a  classification  number  is  returned,  indicating  the  most 
likely  of  the  word’s  accent  patterns  given  the  context8. 

8If  all  entries  in  a  decision  list  fail  to  match  in  a  par¬ 
ticular  new  context,  a  final  entry  called  DEFAULT  is  used; 


92 


From  a  statistical  perspective,  the  evidence  at  the  top 
of  this  list  will  most  reliably  disambiguate  the  target 
word.  Given  a  word  in  a  new  context  to  be  assigned  an 
accent  pattern,  if  we  may  only  base  the  classification 
on  a  single  line  in  the  decision  list,  it  should  be  the 
highest  ranking  pattern  that  is  present  in  the  target 
context.  This  is  uncontroversial,  and  is  solidly  based  in 
Bayesian  decision  theory. 

The  question,  however,  is  what  to  do  with  the  less- 
reliable  evidence  that  may  also  be  present  in  the  target 
context.  The  common  tradition  is  to  combine  the  avail¬ 
able  evidence  in  a  weighted  sum  or  product.  This  is 
done  by  Bayesian  classifiers,  neural  nets,  IR-based  clas¬ 
sifiers  and  N-gram  part-of-speech  taggers.  The  system 
reported  here  is  unusual  in  that  it  does  no  such  combi¬ 
nation.  Only  the  single  most  reliable  piece  of  evidence 
matched  in  the  target  context  is  used.  For  example,  in 
a  context  of  cote  containing  poisson,  ports  and  atlan- 
tique,  if  the  adjacent  feminine  article  la  cote  (the  coast) 
is  present,  only  this  best  evidence  is  used  and  the  sup¬ 
porting  semantic  information  ignored.  Note  that  if  the 
masculine  article  le  cote  (the  side)  were  present  in  a  sim¬ 
ilar  maritime  context,  the  most  reliable  evidence  (gen¬ 
der  agreement)  would  override  the  semantic  clues  which 
would  otherwise  dominate  if  all  evidence  was  combined. 
If  no  gender  agreement  constraint  were  present  in  that 
context,  the  first  matching  semantic  evidence  would  be 
used. 

There  are  several  motivations  for  this  approach.  The 
first  is  that  combining  all  available  evidence  rarely  pro¬ 
duces  a  different  classification  than  just  using  the  single 
most  reliable  evidence,  and  when  these  differ  it  is  as 
likely  to  hurt  as  to  help.  In  a  study  comparing  results 
for  20  words  in  a  binary  homograph  disambiguation 
task,  based  strictly  on  words  in  local  (±4  word)  con¬ 
text,  the  following  differences  were  observed  between  an 
algorithm  taking  the  single  best  evidence,  and  an  other¬ 
wise  identical  algorithm  combining  all  available  match¬ 
ing  evidence:9 


Combining  vs.  Not  Combining  Probabilities 


Agree  -  Both  classifications  correct 
Both  classifications  incorrect 

92% 

6% 

Disagree  -  Single  best  evidence  correct 
Combined  evidence  correct 

1.3% 

0.7% 

Total  - 

100% 

Of  course  that  this  behavior  does  not  hold  for  all 
classification  tasks,  but  does  seem  to  be  characteristic 
of  lexically-based  word  classifications.  This  may  be  ex¬ 
plained  by  the  empirical  observation  that  in  most  cases, 
and  with  high  probability,  words  exhibit  only  one  sense 
in  a  given  collocation  (Yarowsky,  1993).  Thus  for  this 
type  of  ambiguity  resolution,  there  is  no  apparent  detri¬ 
ment,  and  some  apparent  performance  gain,  from  us- 

it  indicates  the  most  likely  accent  pattern  in  cases  where 
nothing  matches. 

9 In  cases  of  disagreement,  using  the  single  best  evidence 
outperforms  the  combination  of  evidence  65%  to  35%.  This 
observed  difference  is  1.9  standard  deviations  greater  than 
expected  by  chance  and  is  statistically  significant. 


ing  only  the  single  most  reliable  evidence  in  a  classifi¬ 
cation.  There  are  other  advantages  as  well,  including 
run-time  efficiency  and  ease  of  parallelization.  However, 
the  greatest  gain  comes  from  the  ability  to  incorporate 
multiple,  non-independent  information  types  in  the  de¬ 
cision  procedure.  As  noted  above,  a  given  word  in  con¬ 
text  (such  as  Castillos)  may  match  several  times  in  the 
decision  list,  once  for  its  parts  of  speech,  lemma,  capi¬ 
talized  and  capitalization-free  forms,  and  possible  word- 
classes  as  well.  By  only  using  one  of  these  matches,  the 
gross  exaggeration  of  probability  from  combining  all  of 
these  non-independent  log-likelihoods  is  avoided.  While 
these  dependencies  may  be  modeled  and  corrected  for 
in  Bayesian  formalisms,  it  is  difficult  and  costly  to  do 
so.  Using  only  one  log-likelihood  ratio  without  combi¬ 
nation  frees  the  algorithm  to  include  a  wide  spectrum  of 
highly  non-independent  information  without  additional 
algorithmic  complexity  or  performance  loss. 

EVALUATION 

Because  we  have  only  stripped  accents  artificially  for 
testing  purposes,  and  the  “correct”  patterns  exist  on¬ 
line  in  the  original  corpus,  we  can  evaluate  perfor¬ 
mance  objectively  and  automatically.  This  contrasts 
with  other  classification  tasks  such  as  word-sense  dis¬ 
ambiguation  and  part-of-speech  tagging,  where  at  some 
point  human  judgements  are  required.  Regrettably, 
however,  there  are  errors  in  the  original  corpus,  which 
can  be  quite  substantial  depending  on  the  type  of  ac¬ 
cent.  For  example,  in  the  Spanish  data,  accents  over 
the  i  (i)  are  frequently  omitted;  in  a  sample  test  3.7% 
of  the  appropriate  f  accents  were  missing.  Thus  the  fol¬ 
lowing  results  must  be  interpreted  as  agreement  rates 
with  the  corpus  accent  pattern;  the  true  percent  correct 
may  be  several  percentage  points  higher. 

The  following  table  gives  a  breakdown  of  the  differ¬ 
ent  types  of  Spanish  accent  ambiguities,  their  relative 
frequency  in  the  training  corpus,  and  the  algorithm’s 
performance  on  each:10 


Summary  of  Performance  on  Spanish: 


Ambiguous  Cases  (18%  of  tokens): 

Type 

Freq. 

Agreement 

Prior 

-oj-6 

81  % 

98  % 

'  86% " 

-ara/ -ara,-aran/-aran 

4  % 

92  % 

84% 

Function  Words 

13  % 

98  % 

94% 

Other 

2  % 

97  % 

95% 

Total 

98  % 

93% 

Unambiguous  Cases  (82%  of  tokens): 

! . HJIIIIi 

100  % 

100% 

Overall  Performance: 

99.6  % 

98.7% 

As  observed  before,  the  prior  probabilities  in  favor  of 
the  most  common  accent  pattern  are  highly  skewed,  so 
one  does  reasonably  well  at  this  task  by  always  using 
the  most  common  pattern.  But  the  error  rate  is  still 

10  The  term  prior  is  a  measure  of  the  baseline  performance 
one  would  expect  if  the  algorithm  always  chose  the  most 
common  option. 
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roughly  1  per  every  75  words,  which  is  unacceptably 
high.  This  algorithm  reduces  that  error  rate  by  over 
65%.  However,  to  get  a  better  picture  of  the  algorithm’s 
performance,  the  following  table  gives  a  breakdown  of 
results  for  a  random  set  of  the  most  problematic  cases 
-  words  exhibiting  the  largest  absolute  number  of  the 
non-majority  accent  patterns.  Collectively  they  consti¬ 
tute  the  most  common  potential  sources  of  error. 

Performance  on  Individual  Ambiguities 


Spanish; 


Pattern  1 

Pattern  2 

Agrmnt 

Prior 

N 

anuncio 

anuncio 

98.4% 

57% 

9459 

registro 

registro 

98.4% 

60% 

2596 

mar  co 

marco 

98.2% 

52% 

2069 

completo 

completo 

98.1% 

54% 

1701 

retiro 

retiro 

97.5% 

56% 

3713 

duro 

duro 

96.8% 

52% 

1466 

paso 

paso 

96.4% 

50% 

6383 

regalo 

regalo 

90.7% 

56% 

280 

terminara 

terminara 

82.9% 

59% 

218 

llegara 

llegara 

78.4% 

64% 

860 

deje 

deje 

89.1% 

68% 

313 

gane 

gane 

80.7% 

60% 

279 

secretaria 

secretaria 

84.5% 

52% 

1065 

seria 

seria 

r  97.7% 

93% 

1065 

hacia 

hacia 

97.3% 

91% 

2483 

esta 

esta 

97.1% 

61% 

14140 

mi 

mi 

93.7% 

82% 

1221 

French: 


cesse 

cesse 

97.7% 

53% 

1262 

decide 

decide 

96.5% 

64% 

3667 

laisse 

laisse 

95.5% 

50% 

2624 

commence 

commence 

95.2% 

54% 

2105 

cote 

cote 

98.1% 

69% 

3893 

traite 

traite 

95.6% 

71% 

2865 

Evaluation  is  based  on  the  corpora  described  in  the 
algorithm’s  Step  2.  In  all  experiments,  4/5  of  the  data 
was  used  for  training  and  the  remaining  1/5  held  out 
for  testing.  More  accurate  measures  of  algorithm  per¬ 
formance  were  obtained  by  repeating  each  experiment 
5  times,  using  a  different  1/5  of  the  data  for  each  test, 
and  averaging  the  results.  Note  that  in  every  experi¬ 
ment,  results  were  measured  on  independent  test  data 
not  seen  in  the  training  phase. 

It  should  be  emphasized  that  the  actual  percent  cor¬ 
rect  is  higher  than  these  agreement  figures,  due  to  errors 
in  the  original  corpus.  The  relatively  low  agreement 
rate  on  words  with  accented  i’s  (f)  is  a  result  of  this. 
To  study  this  discrepancy  further,  a  human  judge  fluent 
in  Spanish  determined  whether  the  corpus  or  decision 
list  algorithm  was  correct  in  two  cases  of  disagreement. 
For  the  ambiguity  case  of  mi/ mi,  the  corpus  was  incor¬ 
rect  in  46%  of  the  disputed  tokens.  For  the  ambiguity 
anuncio/  anuncio,  the  corpus  was  incorrect  in  56%  of 
the  disputed  tokens.  I  hope  to  obtain  a  more  reliable 
source  of  test  material.  However,  it  does  appear  that 
in  some  cases  the  system’s  precision  may  rival  that  of 
the  AP  Newswire’s  Spanish  writers  and  translators. 


DISCUSSION 

The  algorithm  presented  here  has  several  advantages 
which  make  it  suitable  for  general  lexical  disambigua¬ 
tion  tasks  that  require  integrating  both  semantic  and 
syntactic  distinctions.  The  incorporation  of  word  (and 
optionally  part-of-speech)  trigrams  allows  the  modeling 
of  many  local  syntactic  constraints,  while  collocational 
evidence  in  a  wider  context  allows  for  more  semantic 
distinctions.  A  key  advantage  of  this  approach  is  that 
it  allows  the  use  of  multiple,  highly  non-independent  ev¬ 
idence  types  (such  as  root  form,  inflected  form,  part  of 
speech,  thesaurus  category  or  application-specific  clus¬ 
ters)  and  does  so  in  a  way  that  avoids  the  complex 
modeling  of  statistical  dependencies.  This  allows  the 
decision  lists  to  find  the  level  of  representation  that  best 
matches  the  observed  probability  distributions.  It  is  a 
kitchen-sink  approach  of  the  best  kind  -  throw  in  many 
types  of  potentially  relevant  features  and  watch  what 
floats  to  the  top.  While  there  are  certainly  other  ways 
to  combine  such  evidence,  this  approach  has  many  ad¬ 
vantages.  In  particular,  precision  seems  to  be  at  least  as 
good  as  that  achieved  with  Bayesian  methods  applied 
to  the  same  evidence.  This  is  not  surprising,  given  the 
observation  in  (Leacock  et  ah,  1993)  that  widely  diver¬ 
gent  sense-disambiguation  algorithms  tend  to  perform 
roughly  the  same  given  the  same  evidence.  The  distin¬ 
guishing  criteria  therefore  become: 

•  How  readily  can  new  and  multiple  types  of  evidence 
be  incorporated  into  the  algorithm? 

•  How  easy  is  the  output  to  understand? 

•  Can  the  resulting  decision  procedure  be  easily  edited 
by  hand? 

•  Is  it  simple  to  implement  and  replicate,  and  can  it  be 
applied  quickly  to  new  domains? 

The  current  algorithm  rates  very  highly  on  all  these 
standards  of  evaluation,  especially  relative  to  some  of 
the  impenetrable  black  boxes  produced  by  many  ma¬ 
chine  learning  algorithms.  Its  output  is  highly  perspicu¬ 
ous:  the  resulting  decision  list  is  organized  like  a  recipe, 
with  the  most  useful  evidence  first  and  in  highly  read¬ 
able  form.  The  generated  decision  procedure  is  also 
easy  to  augment  by  hand,  changing  or  adding  patterns 
to  the  list.  The  algorithm  is  also  extremely  flexible  -  it 
is  quite  straightforward  to  use  any  new  feature  for  which 
a  probability  distribution  can  be  calculated.  This  is  a 
considerable  strength  relative  to  other  algorithms  which 
are  more  constrained  in  their  ability  to  handle  diverse 
types  of  evidence.  In  a  comparative  study  (Yarowsky, 
1994),  the  decision  list  algorithm  outperformed  both 
an  N-Gram  tagger  and  Bayesian  classifier  primarily  be¬ 
cause  it  could  effectively  integrate  a  wider  range  of 
available  evidence  types  than  its  competitors.  Although 
a  part-of-speech  tagger  exploiting  gender  and  number 
agreement  might  resolve  many  accent  ambiguities,  such 
constraints  will  fail  to  apply  in  many  cases  and  are  dif¬ 
ficult  to  apply  generally,  given  the  the  problem  of  iden¬ 
tifying  agreement  relationships.  It  would  also  be  at 
considerable  cost,  as  good  taggers  or  parsers  typically 
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involve  several  person-years  of  development,  plus  often 
expensive  proprietary  lexicons  and  hand-tagged  train¬ 
ing  corpora.  In  contrast,  the  current  algorithm  could 
be  applied  quite  quickly  and  cheaply  to  this  problem.  It 
was  originally  developed  for  homograph  disambiguation 
in  text-to-speech  synthesis  (Sproat  et  al.,  1992),  and 
was  applied  to  the  problem  of  accent  restoration  with 
virtually  no  modifications  in  the  code.  It  was  applied  to 
a  new  language,  French,  in  a  matter  of  days  and  with  no 
special  lexical  resources  or  linguistic  knowledge,  basing 
its  performance  upon  a  strictly  self-organizing  analysis 
of  the  distributional  properties  of  French  text.  The  flex¬ 
ibility  and  generality  of  the  algorithm  and  its  potential 
feature  set  makes  it  readily  applicable  to  other  prob¬ 
lems  of  recovering  lost  information  from  text  corpora;  I 
am  currently  pursuing  its  application  to  such  problems 
as  capitalization  restoration  and  the  task  of  recovering 
vowels  in  Hebrew  text. 

CONCLUSION 

This  paper  has  presented  a  general-purpose  algorithm 
for  lexical  ambiguity  resolution  that  is  perspicuous, 
easy  to  implement,  flexible  and  applied  quickly  to  new 
domains.  It  incorporates  class-based  models  at  sev¬ 
eral  levels,  and  while  it  requires  no  special  lexical  re¬ 
sources  or  linguistic  knowledge,  it  effectively  and  trans¬ 
parently  incorporates  those  which  are  available.  It  suc¬ 
cessfully  integrates  part-of-speech  patterns  with  local 
and  longer-distance  collocational  information  to  resolve 
both  semantic  and  syntactic  ambiguities.  Finally,  al¬ 
though  the  case  study  of  accent  restoration  in  Spanish 
and  French  was  chosen  for  its  diversity  of  ambiguity 
types  and  plentiful  source  of  data  for  fully  automatic 
and  objective  evaluation,  the  algorithm  solves  a  worth¬ 
while  problem  in  its  own  right  with  promising  commer¬ 
cial  potential. 
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